html_summary <- function(model, title = "Model Summary") {
summary_text <- capture.output(summary(model))
cat(glue::glue("<h4>{title}</h4>"))
cat('<div style="border:1px solid #ccc; padding:10px; max-height:300px; overflow:auto; font-family:monospace; background:#f9f9f9;">')
cat("<pre>")
cat(paste(summary_text, collapse = "\n"))
cat("</pre>")
cat("</div>")
}
# Load data
load("../data/Residen.RData")
Data <- Residen
# Extract variables by category
time_vars <- colnames(Data)[1:4] # Time variables (START YEAR, START QUARTER, etc.)
project_vars <- paste0("V", 1:8) # Project physical/financial variables (V1-V8)
economic_lag_vars <- list( # Define economic lag groups (Lag1-Lag5)
Lag1 = paste0("V", 9:27), # V9-V27
Lag2 = paste0("V", 28:46), # V28-V46
Lag3 = paste0("V", 47:65), # V47-V65
Lag4 = paste0("V", 66:84), # V66-V84
Lag5 = paste0("V", 85:103)) # V85-V103
output_vars <- c("V104", "V105") # Output variables (sales price and construction costs)
# Define a Reusable Function for Correlation Heatmaps
plot_cor_heatmap <- function(data, vars, title = "Correlation Heatmap",
k_col = 2, k_row = 2,
fontsize_row = 8, fontsize_col = 8,
width = 800, height = 600) {
# Define color gradient: red-white-blue
custom_palette <- colorRampPalette(c(
"#FF0000", # Red (for -1)
"#FFFFFF", # White (for 0)
"#00008B" # Dark blue (for 1.0)
))(50)
# Extract subset of variables
selected_vars <- data[, vars, drop = FALSE]
# Calculate correlation matrix
cor_matrix <- cor(selected_vars, use = "complete.obs")
# Plot interactive heatmap
heatmaply(
cor_matrix,
colors = custom_palette, # Apply custom color
limits = c(-1, 1), # Force color range to [-1,1]
k_col = k_col,
k_row = k_row,
fontsize_row = fontsize_row,
fontsize_col = fontsize_col,
main = title,
show_dendrogram = c(TRUE, TRUE), # Show row/column dendrograms
width = width,
height = height
)
}
1.Time Variables vs. Outputs
# Analyze Sales Price (V104) by starting year
ggplot(Data, aes(x = as.factor(`START YEAR`), y = V104)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Sales Price by Starting Year", x = "Year", y = "Sales Price (V104)")
# Analyze Sales Price (V104) by starting quarter
ggplot(Data, aes(x = as.factor(`START QUARTER`), y = V104)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Sales Price by Starting Quarter", x = "Quarter", y = "Sales Price (V104)")
# Analyze Construction Costs (V105) by starting year
ggplot(Data, aes(x = as.factor(`START YEAR`), y = V105)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Construction Costs by Starting Year", x = "Year", y = "Construction Costs (V105)")
# Analyze Construction Costs (V105) by starting quarter
ggplot(Data, aes(x = as.factor(`START QUARTER`), y = V105)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Construction Costs by Starting Quarter", x = "Quarter", y = "Construction Costs (V105)")
cor_time_output <- plot_cor_heatmap(
data = Data,
vars = c(time_vars, output_vars),
title = " Correlation: Time Variables vs. Outputs")
cor_time_output
Time Variables vs. Output Variables – Observations
From the heatmap and boxplots, several patterns emerge:
The heatmap shows almost a perfect positive correlation (r = 0.988) between start year and completion year This makes sense, as construction durations are relatively stable, leading to highly aligned project start and completion times.
Actual Sales Price (V104) and Construction Cost (V105) correlate strong (r = 0.796), confirming that higher costs typically drive higher sales prices.
The boxplots show an overall upward trend in both sales prices and construction costs across years, which is consistent with inflation and general increases in construction expenses over time.
The heatmap further reveals:
Start Year is moderately correlated with Actual Sales Price (r = 0.607) and strongly correlated with Actual Construction Cost (r = 0.753).
Completion year shows similar correlations: r = 0.613 with sales price and r = 0.769 with construction cost.
When examining seasonal effects, Sales Prices are fairly stable across quarters, though Quarter 1 shows a slightly higher median. In contrast, Construction Costs tend to be higher in Quarter 1 and Quarter 4, possibly due to seasonal budgeting needs or increased costs during colder months.
2.Project Variables vs. Outputs
cor_project_output <- plot_cor_heatmap(
data = Data,
vars = c(project_vars, output_vars),
title = " Correlation: Project Variables vs. Outputs")
cor_project_output
Projects Variables vs. Output Variables – Observations
Based on the correlation heatmap and the variables described, here’s some pattern I found:
The correlation between Actual Sales Price (V104) and preliminary estimated construction cost (V5) is r = 0.785, while the correlation between V104 and initial unit price (V8) is extremely high at r = 0.976. This suggests that initial unit price(V8) is an excellent predictor of final sales prices(V104). For Actual Construction Costs (V105), the correlations with preliminary cost estimates vary: r = 0.602 with V4 (total preliminary estimated cost) and r = 0.963 with V5 (preliminary estimated construction cost). This indicates that V5 is a much more reliable predictor of v105. V105 (Actual Construction Cost) also has a strong correlation with initial unit price (V8) at r = 0.790. Project locality (V1) shows negative correlations with most variables, suggesting location factors influence costs and prices inversely. The correlation between Total floor area (V2) and Lot area (V3) is r = 0.947, indicating a very strong linear relationship. This makes sense, as larger lots typically accommodate buildings with greater floor area.
The clustering in the dendrogram shows V104 and V8 ,V105 and V5, grouped closely together, confirming their strong relationship.
The data demonstrates that preliminary cost estimates (particularly V5) and initial pricing (V8) are highly reliable predictors of final project outcomes, with correlation coefficients above 0.75 in most key relationships.
# Step 1: Save all heatmaply objects into a list
heatmap_list <- lapply(names(economic_lag_vars), function(lag_name) {
current_vars <- c(economic_lag_vars[[lag_name]], output_vars)
plot_cor_heatmap(
data = Data,
vars = current_vars,
title = paste("Economic Variables:", lag_name),
k_col = 3,
k_row = 3
)
})
# Step 2: Show them one by one
heatmap_list[[1]]
heatmap_list[[2]]
heatmap_list[[3]]
heatmap_list[[4]]
heatmap_list[[5]]
Economic Variables vs. Output Variables – Observations
Based on the five heatmap visualizations showing economic variables at different time lags (Lag1 through Lag5), I found some interesting patterns about correlations between the economic indicators and project outputs:
1.Interest Rate Effects (V18/V37/V56/V75/V94) consistently show strong negative correlations (deep red) with most other economic variables across all time lags. The negative correlation persists across all five time lags, suggesting a consistent long-term relationshipion persists across all five time lags, suggesting a consistent long-term relationship
2.Throughout all five time periods, variables V10 (Building services index), V11 (Wholesale price index of building materials), V13 (Cumulative liquidity), V14 (Private sector investment in new buildings), V15 (Land price index), V19 (Average construction cost at completion), V20 (Average construction cost at beginning), V23 (Consumer price index), V24 (CPI of housing, water, fuel & power), and V27 (Gold price per ounce) demonstrate extraordinarily high correlation coefficients with each other. These correlation values range from a minimum of 0.88 to a maximum of 0.999, approaching perfect correlation. Not only do these variables show strong internal correlations among themselves, but they also consistently exhibit similar positive correlations with the project output variables V104 (Actual sales prices) and V105 (Actual construction costs) across all time lags. This consistent pattern suggests these ten economic indicators move almost in unison and have a uniformly positive influence on both construction costs and sales prices throughout different time periods. The stability of these relationships across all time lags indicates they represent fundamental and reliable economic drivers in the construction and real estate markets.
3.The clustering in the dendrogram shows V9 (number of building permits issued) and V12 (total floor areas of building permits issued), as well as V16 (number of loans extended by banks) and V26 (population of the city), grouped closely together, confirming their strong relationships. The color between each pair is deeper blue, while being lighter with other variables, suggesting strong positive correlations within these two groups and relatively weak relationships with other variables. This pattern logically reflects that building permit counts and the total floor area of those permits are inherently linked, while the number of loans extended by banks naturally correlates with the city’s population size.
lr_model <- lm(V104~ . - V105,data = Data)
#summary(lr_model)
html_summary(lr_model)
Call:
lm(formula = V104 ~ . - V105, data = Data)
Residuals:
Min 1Q Median 3Q Max
-901.15 -52.29 -1.50 45.71 645.31
Coefficients: (32 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.004e+05 1.999e+05 0.502 0.615781
`START YEAR` -1.614e+03 3.388e+03 -0.476 0.634175
`START QUARTER` -4.928e+02 2.457e+03 -0.201 0.841139
`COMPLETION YEAR` 1.521e+02 1.689e+01 9.007 < 2e-16 ***
`COMPLETION QUARTER` 5.984e+01 8.881e+00 6.738 8.36e-11 ***
V1 -4.779e+00 2.225e+00 -2.148 0.032529 *
V2 6.630e-02 2.195e-02 3.020 0.002746 **
V3 -2.331e-01 6.451e-02 -3.612 0.000356 ***
V4 6.442e-03 3.570e-02 0.180 0.856905
V5 -6.548e-01 3.342e-01 -1.960 0.050975 .
V6 8.764e-02 6.322e-02 1.386 0.166727
V7 NA NA NA NA
V8 1.203e+00 1.681e-02 71.579 < 2e-16 ***
V9 2.016e-01 1.103e+00 0.183 0.855159
V10 1.373e+02 9.675e+02 0.142 0.887245
V11 1.048e+01 7.426e+01 0.141 0.887830
V12 -1.626e+02 6.837e+02 -0.238 0.812200
V13 1.329e-04 1.125e-01 0.001 0.999058
V14 1.799e-01 4.339e-01 0.415 0.678643
V15 -2.265e+01 5.008e+01 -0.452 0.651433
V16 8.696e-01 2.753e+00 0.316 0.752315
V17 -4.247e-03 1.860e-01 -0.023 0.981801
V18 2.328e+02 1.235e+03 0.188 0.850672
V19 -2.007e-01 4.647e+00 -0.043 0.965587
V20 -3.910e-01 1.187e+00 -0.329 0.742147
V21 6.422e-02 3.565e-01 0.180 0.857174
V22 3.682e-03 4.524e-01 0.008 0.993511
V23 6.413e+00 6.128e+02 0.010 0.991658
V24 4.270e+01 2.347e+02 0.182 0.855747
V25 5.267e-02 1.033e-01 0.510 0.610369
V26 -4.675e-04 4.268e-01 -0.001 0.999127
V27 9.455e-04 9.654e-03 0.098 0.922046
V28 5.915e-02 1.795e-01 0.330 0.741930
V29 -1.144e+02 8.626e+02 -0.133 0.894615
V30 -7.247e+00 1.452e+02 -0.050 0.960240
V31 -9.085e+01 3.036e+02 -0.299 0.764953
V32 -1.780e-03 8.287e-02 -0.021 0.982879
V33 -5.666e-02 3.707e-01 -0.153 0.878630
V34 -9.607e+00 1.077e+02 -0.089 0.929008
V35 9.323e-01 1.976e+00 0.472 0.637449
V36 1.490e-02 8.094e-02 0.184 0.854034
V37 7.421e+01 6.343e+02 0.117 0.906942
V38 -4.162e-01 6.246e+00 -0.067 0.946913
V39 6.458e-01 2.013e+00 0.321 0.748563
V40 -8.375e-02 1.877e-01 -0.446 0.655784
V41 1.615e-01 5.671e-01 0.285 0.776052
V42 1.890e+02 6.380e+02 0.296 0.767239
V43 -1.891e+02 8.982e+02 -0.211 0.833373
V44 -1.202e-02 5.531e-01 -0.022 0.982673
V45 -1.184e-02 3.554e-01 -0.033 0.973441
V46 -1.038e-03 8.216e-03 -0.126 0.899575
V47 1.040e-01 2.590e-01 0.402 0.688186
V48 1.119e+02 1.334e+03 0.084 0.933224
V49 -3.006e+01 4.980e+01 -0.604 0.546571
V50 -1.746e+02 4.440e+02 -0.393 0.694395
V51 6.526e-03 1.409e-01 0.046 0.963102
V52 -7.985e-02 4.933e-01 -0.162 0.871519
V53 6.439e+00 1.296e+02 0.050 0.960422
V54 2.405e+00 4.657e+00 0.516 0.605918
V55 -1.698e-02 1.631e-01 -0.104 0.917133
V56 3.089e+01 2.974e+02 0.104 0.917333
V57 -2.722e-01 2.233e+00 -0.122 0.903072
V58 3.539e-01 3.079e+00 0.115 0.908564
V59 -9.660e-02 5.540e-01 -0.174 0.861684
V60 -2.140e-01 1.307e+00 -0.164 0.870070
V61 4.198e+01 1.636e+02 0.257 0.797647
V62 2.040e+02 1.364e+03 0.150 0.881170
V63 -1.273e-01 8.479e-01 -0.150 0.880716
V64 -2.178e-02 3.772e-01 -0.058 0.953994
V65 -9.490e-04 8.072e-03 -0.118 0.906483
V66 6.260e-02 6.990e-01 0.090 0.928700
V67 -2.018e+02 7.758e+02 -0.260 0.794924
V68 6.947e+01 7.047e+01 0.986 0.324982
V69 1.248e+02 1.397e+02 0.894 0.372099
V70 -9.328e-03 4.437e-02 -0.210 0.833629
V71 -1.346e-01 4.248e-01 -0.317 0.751583
V72 -1.492e+01 2.247e+02 -0.066 0.947118
V73 NA NA NA NA
V74 NA NA NA NA
V75 NA NA NA NA
V76 NA NA NA NA
V77 NA NA NA NA
V78 NA NA NA NA
V79 NA NA NA NA
V80 NA NA NA NA
V81 NA NA NA NA
V82 NA NA NA NA
V83 NA NA NA NA
V84 NA NA NA NA
V85 NA NA NA NA
V86 NA NA NA NA
V87 NA NA NA NA
V88 NA NA NA NA
V89 NA NA NA NA
V90 NA NA NA NA
V91 NA NA NA NA
V92 NA NA NA NA
V93 NA NA NA NA
V94 NA NA NA NA
V95 NA NA NA NA
V96 NA NA NA NA
V97 NA NA NA NA
V98 NA NA NA NA
V99 NA NA NA NA
V100 NA NA NA NA
V101 NA NA NA NA
V102 NA NA NA NA
V103 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 148.6 on 296 degrees of freedom
Multiple R-squared: 0.9879, Adjusted R-squared: 0.9848
F-statistic: 321.9 on 75 and 296 DF, p-value: < 2.2e-16
From the summary of the linear regression model, we can find that:
1.Residuals: Residuals are roughly symmetric around zero (median = -1.50), indicating indicating unbiased predictions on average.However, residuals range widely (-901.15 to 645.31), suggesting the model struggles with extreme values.
2.Coefficients and Significance: COMPLETION YEAR (β = 152.1, p < 0.001) and COMPLETION QUARTER (β = 59.84, p < 0.001): These two variables both have a significant positive impact on the model, indicating that the completion year and quarter are crucial predictors of the outcome. V3(β = -0.2331, p < 0.001) and V8(β = 1.203, p < 0.001):V3 has a significant negative impact, while V8 has a significant positive impact,, with V3 stands for Lot area and V8 to Price of the unit at the beginning of the project per m2. V1(β = -4.779, p < 0.05) and V2(β = 0.0663, p < 0.05):V1 has a significant negative impact, while V2 has a significant positive impact. at the 0.05 level, with V1 stands for project locality and V2 to total floor area of the buildings. V5(β = -0.6548, p < 0.1) :V5 has a marginally significant negative impact,stands for Preliminary estimated construction cost based on the prices at the beginning of the project. The linear regression model supports the findings from my previous data visualization analysis.
3.Model Performance: Adjusted R-squared: 0.9848: This tells us that the model explains about 98.48% of the changes in actual sales prices (V104). This shows the model is a good fit for the data.But High R-squared with excessive predictors (75 variables) may harm generalizability. We need careful about the overfitting risk for this model. F-statistic = 321.9 with a p-value < 2.2e-16, indicating that Model is statistically significant overall.
4.Multicollinearity and undefined coefficients: There are undefined coefficients for 32 predictor variables. This points to a multicollinearity issue. This means some variables are very closely related, making it hard to tell their separate effects. We can use VIF to check for multicollinearity and think about removing or combining variables that are highly correlated.
# Split the data
set.seed(123) # Set seed for reproducibility
train_index <- sample(1:nrow(Data), 0.8 * nrow(Data))
train_data <- Data[train_index, ]
test_data <- Data[-train_index, ]
# Define full model
full_model <- lm(V104 ~ . -V105, data = train_data)
# (a) Backwards Selection
start_time_backward <- Sys.time()
backward_model <- stepAIC(full_model, direction = "backward", trace = 0)
end_time_backward <- Sys.time()
backward_time <- end_time_backward - start_time_backward
backward_pred <- predict(backward_model, newdata = test_data)
backward_mse <- mean((backward_pred - test_data$V104)^2)
# (b) Stepwise Selection
start_time_stepwise <- Sys.time()
stepwise_model <- stepAIC(full_model, direction = "both", trace = 0)
end_time_stepwise <- Sys.time()
stepwise_time <- end_time_stepwise - start_time_stepwise
stepwise_pred <- predict(stepwise_model, newdata = test_data)
stepwise_mse <- mean((stepwise_pred - test_data$V104)^2)
# Compare models
comparison_results <- data.frame(
Model = c("Backward", "Stepwise"),
Variables = c(length(coef(backward_model)), length(coef(stepwise_model))),
R_Squared = c(summary(backward_model)$r.squared, summary(stepwise_model)$r.squared),
Adj_R2 = c(summary(backward_model)$adj.r.squared, summary(stepwise_model)$adj.r.squared),
Time = c(backward_time, stepwise_time),
AIC = c(AIC(backward_model), AIC(stepwise_model)),
BIC = c(BIC(backward_model), BIC(stepwise_model)),
Holdout_MSE = c(backward_mse, stepwise_mse)
)
print(comparison_results)
## Model Variables R_Squared Adj_R2 Time AIC BIC
## 1 Backward 44 0.9886593 0.9867319 11.28285 secs 3793.543 3959.760
## 2 Stepwise 23 0.9886212 0.9877076 16.49312 secs 3752.539 3841.189
## Holdout_MSE
## 1 89101.99
## 2 69494.02
Base on the table above, it can be found that:
Model Complexity: The Backward Selection model is more complex,retained 44 variables, while the Stepwise Selection model is simpler but efficient, including fewer predictors retained 23 variables, resulting in a more parsimonious model, which may improve interpretability and generalizability.
Fit Quality: Both models have almost identical R² values, showing similar explanatory power. However, Stepwise has a higher adjusted R², which accounts for model complexity
Model Selection Criteria (AIC & BIC): Both AIC and BIC are lower for the Stepwise model, suggesting it achieves better fit with fewer variables.That means it strikes a better balance between model complexity and goodness of fit.
Holdout MSE : The Stepwise model has a much lower MSE on the holdout set, suggesting better predictive performance and generalizability.
Computation Time: Backward was faster, but given the other metrics, this difference (~3.5s) is likely negligible for most practical purposes.
The Stepwise model is preferred overall: it’s simpler, generalizes better, and has better performance on most criteria (especially adjusted R², AIC, BIC, and holdout MSE), despite being a little slower to compute.
# Use the same training/testing split as previously
# Set cross-validation parameters
set.seed(123)
# Create model matrix (excluding V105)
X_train <- model.matrix(V104 ~ . - V105, data = train_data)
y_train <- train_data$V104
X_test <- model.matrix(V104 ~ . - V105, data = test_data)
y_test <- test_data$V104
### ---- (i) Ridge Regression (alpha = 0) ---- ###
start_ridge <- Sys.time()
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)
end_ridge <- Sys.time()
time_ridge <- end_ridge - start_ridge
ridge_best_lambda <- cv_ridge$lambda.min
best_ridge_mod <- glmnet(X_train, y_train, alpha = 0, lambda = ridge_best_lambda)
ridge_pred <- predict(best_ridge_mod, newx = X_test)
ridge_mse <- mean((ridge_pred - y_test)^2)
### ---- (ii) LASSO Regression (alpha = 1) ---- ###
start_lasso <- Sys.time()
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)
end_lasso <- Sys.time()
time_lasso <- end_lasso - start_lasso
lasso_best_lambda <- cv_lasso$lambda.min
best_lasso_mod <- glmnet(X_train, y_train, alpha = 1, lambda = lasso_best_lambda)
lasso_pred <- predict(best_lasso_mod, newx = X_test)
lasso_mse <- mean((lasso_pred - y_test)^2)
### ---- Cross-validation plot ---- ###
par(mfrow = c(1, 2))
plot(cv_ridge, main = "", xlab = "log(Lambda)", ylab = "CV Error")
title(main = "Ridge: Cross-Validation", line = 2.5)
plot(cv_lasso, main = "", xlab = "log(Lambda)", ylab = "CV Error")
title(main = "LASSO: Cross-Validation", line = 2.5)
par(mfrow = c(1, 1))
### --- Model Comparison Table --- ###
model_comparison <- data.frame(
Model = c("Ridge", "LASSO","Backward", "Stepwise"),
Time = c(time_ridge, time_lasso, backward_time, stepwise_time),
`CV MSE` = c(min(cv_ridge$cvm), min(cv_lasso$cvm), NA, NA),
`Holdout MSE` = c(ridge_mse, lasso_mse, backward_mse, stepwise_mse),
Predictors = c(length(coef(best_ridge_mod))-1, length(which(coef(best_lasso_mod)!=0))-1,
length(coef(backward_model))-1, length(coef(stepwise_model))-1)
)
print(model_comparison)
## Model Time CV.MSE Holdout.MSE Predictors
## 1 Ridge 0.241905 secs 48080.05 92275.39 108
## 2 LASSO 0.119885 secs 27436.48 61648.42 34
## 3 Backward 11.282853 secs NA 89101.99 43
## 4 Stepwise 16.493119 secs NA 69494.02 22
From the comparison of the four models (Ridge, LASSO, Backward, and Stepwise selection), LASSO demonstrates the best overall performance in this case, and here’s why:
Lower Prediction Error:
LASSO achieves the lowest Holdout MSE (61648.42 ) compared to Ridge (92275.39), Backward (89101.99 ), and Stepwise (69494.02). This indicates LASSO generalizes better to unseen data.Its lower CV MSE (27436.48 ) (compared to Ridge’s 48080.05) further validates its robustness.
Computational Efficiency:
LASSO runs much faster (0.125 seconds) than Backward (9.823s) and Stepwise (14.149s), making it scalable for large datasets.Ridge (0.240s) is also fast but performs worse in prediction accuracy.
Variable Sparsity:
LASSO retains only 40 predictors (vs. Ridge’s 108 and Backward’s 43), reducing overfitting risk while maintaining interpretability.Ridge Model keeps all the predictors (108 here), but makes their effects smaller.Stepwise selects fewer predictors (22) but has higher Holdout MSE than LASSO, suggesting it might discard useful variables.
Why LASSO Outperforms Backward/Stepwise:
Automatic Feature Selection: LASSO’s L1 regularization shrinks irrelevant coefficients to exact zero, avoiding the “greedy” pitfalls of stepwise methods, which can get stuck in suboptimal variable subsets.
Handling Multicollinearity: LASSO and Ridge handle correlated predictors better than stepwise methods, which often fail in high-dimensional settings.
While Stepwise is simpler to interpret, its computational cost and higher error make it less practical here.Ridge retains all variables, leading to higher complexity and overfitting.LASSO balances accuracy, speed, and simplicity effectively, making it the superior choice for this dataset. Its ability to select key predictors while minimizing prediction error aligns well with the observed data patterns.
Step 1: Data Preparation
# Load the Parkinson's dataset
parkinsons_data <- read.csv("../data/parkinsons.csv", header = TRUE)[-1]
# Function to split and scale data
prepare_data <- function(data, seed) {
set.seed(seed)
n <- nrow(data)
train_indices <- sample(1:n, 30)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
X_train <- as.matrix(train_data[, paste0("X", 1:97)])
y_train <- train_data$UPDRS
X_test <- as.matrix(test_data[, paste0("X", 1:97)])
y_test <- test_data$UPDRS
X_train_scaled <- scale(X_train)
X_test_scaled <- scale(X_test,
center = attr(X_train_scaled, "scaled:center"),
scale = attr(X_train_scaled, "scaled:scale"))
return(list(
X_train = X_train_scaled,
y_train = y_train,
X_test = X_test_scaled,
y_test = y_test
))
}
lm_data <- prepare_data(parkinsons_data, 123)
# Fit a linear model
lm_model <- lm(lm_data$y_train ~ lm_data$X_train)
#summary(lm_model)
html_summary(lm_model, title = "trial")
Call:
lm(formula = lm_data$y_train ~ lm_data$X_train)
Residuals:
ALL 30 residuals are 0: no residual degrees of freedom!
Coefficients: (68 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.14 NaN NaN NaN
lm_data$X_trainX1 -555.71 NaN NaN NaN
lm_data$X_trainX2 -34.59 NaN NaN NaN
lm_data$X_trainX3 585.96 NaN NaN NaN
lm_data$X_trainX4 -153.57 NaN NaN NaN
lm_data$X_trainX5 -599.82 NaN NaN NaN
lm_data$X_trainX6 -24.87 NaN NaN NaN
lm_data$X_trainX7 51.83 NaN NaN NaN
lm_data$X_trainX8 170.76 NaN NaN NaN
lm_data$X_trainX9 -225.87 NaN NaN NaN
lm_data$X_trainX10 28.81 NaN NaN NaN
lm_data$X_trainX11 173.40 NaN NaN NaN
lm_data$X_trainX12 -86.58 NaN NaN NaN
lm_data$X_trainX13 31994.20 NaN NaN NaN
lm_data$X_trainX14 6745.51 NaN NaN NaN
lm_data$X_trainX15 1379.28 NaN NaN NaN
lm_data$X_trainX16 -32628.15 NaN NaN NaN
lm_data$X_trainX17 -3149.37 NaN NaN NaN
lm_data$X_trainX18 188.29 NaN NaN NaN
lm_data$X_trainX19 713.14 NaN NaN NaN
lm_data$X_trainX20 -239.04 NaN NaN NaN
lm_data$X_trainX21 -283.17 NaN NaN NaN
lm_data$X_trainX22 278.35 NaN NaN NaN
lm_data$X_trainX23 489.80 NaN NaN NaN
lm_data$X_trainX24 -125.69 NaN NaN NaN
lm_data$X_trainX25 -32067.51 NaN NaN NaN
lm_data$X_trainX26 -6759.54 NaN NaN NaN
lm_data$X_trainX27 -1427.30 NaN NaN NaN
lm_data$X_trainX28 32511.82 NaN NaN NaN
lm_data$X_trainX29 3023.98 NaN NaN NaN
lm_data$X_trainX30 NA NA NA NA
lm_data$X_trainX31 NA NA NA NA
lm_data$X_trainX32 NA NA NA NA
lm_data$X_trainX33 NA NA NA NA
lm_data$X_trainX34 NA NA NA NA
lm_data$X_trainX35 NA NA NA NA
lm_data$X_trainX36 NA NA NA NA
lm_data$X_trainX37 NA NA NA NA
lm_data$X_trainX38 NA NA NA NA
lm_data$X_trainX39 NA NA NA NA
lm_data$X_trainX40 NA NA NA NA
lm_data$X_trainX41 NA NA NA NA
lm_data$X_trainX42 NA NA NA NA
lm_data$X_trainX43 NA NA NA NA
lm_data$X_trainX44 NA NA NA NA
lm_data$X_trainX45 NA NA NA NA
lm_data$X_trainX46 NA NA NA NA
lm_data$X_trainX47 NA NA NA NA
lm_data$X_trainX48 NA NA NA NA
lm_data$X_trainX49 NA NA NA NA
lm_data$X_trainX50 NA NA NA NA
lm_data$X_trainX51 NA NA NA NA
lm_data$X_trainX52 NA NA NA NA
lm_data$X_trainX53 NA NA NA NA
lm_data$X_trainX54 NA NA NA NA
lm_data$X_trainX55 NA NA NA NA
lm_data$X_trainX56 NA NA NA NA
lm_data$X_trainX57 NA NA NA NA
lm_data$X_trainX58 NA NA NA NA
lm_data$X_trainX59 NA NA NA NA
lm_data$X_trainX60 NA NA NA NA
lm_data$X_trainX61 NA NA NA NA
lm_data$X_trainX62 NA NA NA NA
lm_data$X_trainX63 NA NA NA NA
lm_data$X_trainX64 NA NA NA NA
lm_data$X_trainX65 NA NA NA NA
lm_data$X_trainX66 NA NA NA NA
lm_data$X_trainX67 NA NA NA NA
lm_data$X_trainX68 NA NA NA NA
lm_data$X_trainX69 NA NA NA NA
lm_data$X_trainX70 NA NA NA NA
lm_data$X_trainX71 NA NA NA NA
lm_data$X_trainX72 NA NA NA NA
lm_data$X_trainX73 NA NA NA NA
lm_data$X_trainX74 NA NA NA NA
lm_data$X_trainX75 NA NA NA NA
lm_data$X_trainX76 NA NA NA NA
lm_data$X_trainX77 NA NA NA NA
lm_data$X_trainX78 NA NA NA NA
lm_data$X_trainX79 NA NA NA NA
lm_data$X_trainX80 NA NA NA NA
lm_data$X_trainX81 NA NA NA NA
lm_data$X_trainX82 NA NA NA NA
lm_data$X_trainX83 NA NA NA NA
lm_data$X_trainX84 NA NA NA NA
lm_data$X_trainX85 NA NA NA NA
lm_data$X_trainX86 NA NA NA NA
lm_data$X_trainX87 NA NA NA NA
lm_data$X_trainX88 NA NA NA NA
lm_data$X_trainX89 NA NA NA NA
lm_data$X_trainX90 NA NA NA NA
lm_data$X_trainX91 NA NA NA NA
lm_data$X_trainX92 NA NA NA NA
lm_data$X_trainX93 NA NA NA NA
lm_data$X_trainX94 NA NA NA NA
lm_data$X_trainX95 NA NA NA NA
lm_data$X_trainX96 NA NA NA NA
lm_data$X_trainX97 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 29 and 0 DF, p-value: NA
The R-squared value of 1 indicates a perfect fit on the training data. However, this is likely due to overfitting, as the model is using 97 predictors with only 30 observations. In such cases, the model memorizes the data instead of learning meaningful relationships, which results in poor generalization to new data.
# General LASSO function to fit and return key results
lasso_regression <- function(X_train, y_train, X_test, y_test) {
# Define lambda grid for tuning
grid <- 10^seq(3, -1, length.out = 100)
# Perform Leave-One-Out Cross-Validation
lasso_cv <- cv.glmnet(X_train, y_train, alpha = 1, lambda = grid,
nfolds = nrow(X_train), thresh = 1e-10)
# Extract the best lambda value
optimal_lambda <- lasso_cv$lambda.min
# Fit final LASSO model using the best lambda
lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda = optimal_lambda)
# Predict on the test set and compute mean squared error
y_pred <- predict(lasso_model, newx = X_test)
test_error <- mean((y_test - y_pred)^2)
# Get the names of selected (non-zero) features
coef_full <- coef(lasso_model)
selected_features <- rownames(coef_full)[coef_full[, 1] != 0][-1]
# Build the final model formula as a string
formula_terms <- sprintf("%.4f * %s", coef_full[selected_features, 1], selected_features)
formula_str <- paste("UPDRS =", round(coef_full[1,1], 4), "+", paste(formula_terms, collapse = " + "))
# Return results
return(list(
lambda = optimal_lambda,
test_error = test_error,
selected_features = selected_features,
n_nonzero = length(selected_features),
formula = formula_str,
cv_plot = lasso_cv
))
}
# Prepare data and fit the model
data123 <- prepare_data(parkinsons_data, 123)
result123 <- lasso_regression(data123$X_train, data123$y_train, data123$X_test, data123$y_test)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Plot cross-validation errors vs. log(lambda)
plot(result123$cv_plot)
# Print key results
cat("Best lambda:", result123$lambda, "\n")
## Best lambda: 1.023531
cat("Test set MSE:", result123$test_error, "\n")
## Test set MSE: 20.41405
cat("Number of selected features:", result123$n_nonzero, "\n")
## Number of selected features: 4
cat("Selected features:", paste(result123$selected_features, collapse = ", "), "\n")
## Selected features: X9, X82, X83, X97
cat("LASSO model equation:\n", result123$formula, "\n")
## LASSO model equation:
## UPDRS = 26.1423 + 0.2039 * X9 + 0.0586 * X82 + 0.3937 * X83 + 7.0572 * X97
The LASSO regression identified a lambda value (≈1.02) that minimized cross-validated MSE, resulting in a model with only four non-zero coefficients: X9, X82, X83, and X97. This demonstrates strong feature selection capability, improving interpretability and reducing overfitting risks. The resulting formula list below: \[ \text{UPDRS} = 26.1423 + 0.2039 \times X_9 + 0.0586 \times X_{82} + 0.3937 \times X_{83} + 7.0572 \times X_{97} \] With LASSO, the final model includes only four key predictors, which significantly improves interpretability. The coefficients provide direct insight into how each selected feature contributes to the UPDRS score. This sparse model performs well (MSE ≈ 20.4), showing a good balance between accuracy and simplicity.
# Prepare data with a different seed and fit the model
data321 <- prepare_data(parkinsons_data, 321)
result321 <- lasso_regression(data321$X_train, data321$y_train, data321$X_test, data321$y_test)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Plot cross-validation result
plot(result321$cv_plot)
# Print key results
cat("Best lambda:", result321$lambda, "\n")
## Best lambda: 0.9326033
cat("Test set MSE:", result321$test_error, "\n")
## Test set MSE: 4.899441
cat("Number of selected features:", result321$n_nonzero, "\n")
## Number of selected features: 3
cat("Selected features:", paste(result321$selected_features, collapse = ", "), "\n")
## Selected features: X83, X95, X97
cat("LASSO model equation:\n", result321$formula, "\n")
## LASSO model equation:
## UPDRS = 25.4929 + 0.2028 * X83 + 0.2533 * X95 + 9.0117 * X97
# Build a comparison table of the two models
comparison_table <- data.frame(
Seed = c(123, 321),
Optimal_Lambda = c(result123$lambda, result321$lambda),
Test_MSE = c(result123$test_error, result321$test_error),
Num_Selected_Features = c(result123$n_nonzero, result321$n_nonzero),
Selected_Features = c(
paste(result123$selected_features, collapse = ", "),
paste(result321$selected_features, collapse = ", ")
)
)
# Display the table
print(comparison_table)
## Seed Optimal_Lambda Test_MSE Num_Selected_Features Selected_Features
## 1 123 1.0235310 20.414054 4 X9, X82, X83, X97
## 2 321 0.9326033 4.899441 3 X83, X95, X97
When We repeated the analysis with a different random seed (from 123 to 321), which changed the training-test split. The test set Mean Squared Error (MSE) for the previous dataset was 20.414054, while for the new dataset, it decreased to 4.899441. This reduction indicates that the model trained with the new dataset performs better on the test set.
The final model selected a slightly different set of features. Under seed = 123, the selected features were: X9, X82, X83, X97 Under seed = 321, the selected features were: X83, X95, X97
While some features like X83 and X97 appeared in both models, others (e.g., X9, X82 and X95) were different. This indicates that the feature selection process is sensitive to the specific training-test split and that small changes in data partitioning can lead to different optimal subsets.
The final model formula list below: \[ \text{UPDRS} = 25.4929 + 0.2028 \times X_{83} + 0.2533 \times X_{95} + 9.0117 \times X_{97} \]
# Load the data
weather_data <- read.csv("../data/Weather_Station_Data_v1.csv")
# Set seed for reproducibility
set.seed(321)
# Split the data into training and test sets
train_indices <- sample(1:nrow(weather_data), size = 0.8 * nrow(weather_data))
train_data <- weather_data[train_indices, ]
test_data <- weather_data[-train_indices, ]
# Define predictors and response
X_train <- as.matrix(train_data[, -which(names(train_data) == "MEAN_ANNUAL_RAINFALL")])
y_train <- train_data$MEAN_ANNUAL_RAINFALL
X_test <- as.matrix(test_data[, -which(names(test_data) == "MEAN_ANNUAL_RAINFALL")])
y_test <- test_data$MEAN_ANNUAL_RAINFALL
# Define a sequence of alpha values
alphas <- seq(0, 1, by = 0.1)
# Perform cross-validation for each alpha
cv_results <- lapply(alphas, function(a) {
cv.glmnet(X_train, y_train, alpha = a, nfolds = 10)
})
# Find the best model based on minimum cross-validated error
best_alpha_idx <- which.min(sapply(cv_results, function(cv) min(cv$cvm)))
best_cv_mod <- cv_results[[best_alpha_idx]]
best_alpha <- alphas[best_alpha_idx]
cat("Best alpha selected is:", best_alpha, "\n")
## Best alpha selected is: 0.3
Base on the cross validation, the best alpha of Elastic Net in this dataset is 0.3, in the next step, the alpha will be set to 0.3 to obtain the best result.
lambda_min <- best_cv_mod$lambda.min
lambda_1se <- best_cv_mod$lambda.1se
# Plot the cross-validation results
plot(best_cv_mod, main = "", xlab = "Log(lambda)", ylab = "Mean Squared Error")
title(main = "ElasticNet Cross-Validation Results", line = 2)
abline(v = log(lambda_min), col = "green", lty = 2)
abline(v = log(lambda_1se), col = "red", lty = 2)
legend("topright", legend = c("lambda.min", "lambda.1se"),
col = c("green", "red"), lty = 2, bty = "n")
cat("The value of Lambda.min is: ", lambda_min, "\n")
## The value of Lambda.min is: 0.3456271
cat("The value of Lambda.1se is: ", lambda_1se, "\n")
## The value of Lambda.1se is: 20.7198
The plot shows the mean squared error (MSE) across different values
of lambda. The green line represents lambda.min (with
lowest MSE), and the red line is lambda.1se (within one
standard error for simpler model).
# Predictions
pred_min <- predict(best_cv_mod, s = "lambda.min", newx = X_test)
pred_1se <- predict(best_cv_mod, s = "lambda.1se", newx = X_test)
# Calculate MSE and RMSE
mse_min <- mean((y_test - pred_min)^2)
rmse_min <- sqrt(mse_min)
mse_1se <- mean((y_test - pred_1se)^2)
rmse_1se <- sqrt(mse_1se)
# Count non-zero coefficients
coef_min <- coef(best_cv_mod, s = "lambda.min")
coef_1se <- coef(best_cv_mod, s = "lambda.1se")
n_pred_min <- sum(coef_min != 0) - 1 # exclude intercept
n_pred_1se <- sum(coef_1se != 0) - 1
# Create summary table
comparison_table <- data.frame(
Model = c("lambda.min", "lambda.1se"),
Alpha = c(best_alpha, best_alpha),
Lambda = c(lambda_min, lambda_1se),
MSE = c(mse_min, mse_1se),
RMSE = c(rmse_min, rmse_1se),
Predictors_Used = c(n_pred_min, n_pred_1se)
)
print(comparison_table)
## Model Alpha Lambda MSE RMSE Predictors_Used
## 1 lambda.min 0.3 0.3456271 11412.89 106.8311 13
## 2 lambda.1se 0.3 20.7198017 13078.75 114.3624 9
# Extract and display coefficients
coef_table_min <- data.frame(
Variable = rownames(as.matrix(coef_min)),
Coefficient = round(as.vector(coef_min), 4)
)
coef_table_1se <- data.frame(
Variable = rownames(as.matrix(coef_1se)),
Coefficient = round(as.vector(coef_1se), 4)
)
cat("Coefficients for lambda.min:\n")
## Coefficients for lambda.min:
print(coef_table_min)
## Variable Coefficient
## 1 (Intercept) 1043.8279
## 2 ALTITUDE 0.1367
## 3 MEAN_ANNUAL_AIR_TEMP -44.2016
## 4 MEAN_MONTHLY_MAX_TEMP 22.8969
## 5 MEAN_MONTHLY_MIN_TEMP -5.3594
## 6 MEAN_ANNUAL_WIND_SPEED -14.1155
## 7 MEAN_CLOUD_COVER 3.7076
## 8 MEAN_ANNUAL_SUNSHINE -0.0120
## 9 MAX_MONTHLY_WIND_SPEED -8.1163
## 10 MAX_AIR_TEMP -29.2842
## 11 MAX_WIND_SPEED -0.2383
## 12 MAX_RAINFALL 20.6347
## 13 MIN_AIR_TEMP 27.2047
## 14 MEAN_RANGE_AIR_TEMP 22.3804
cat("Coefficients for lambda.1se:\n")
## Coefficients for lambda.1se:
print(coef_table_1se)
## Variable Coefficient
## 1 (Intercept) 612.3961
## 2 ALTITUDE 0.1671
## 3 MEAN_ANNUAL_AIR_TEMP -4.9645
## 4 MEAN_MONTHLY_MAX_TEMP 0.0000
## 5 MEAN_MONTHLY_MIN_TEMP -1.0373
## 6 MEAN_ANNUAL_WIND_SPEED -8.0770
## 7 MEAN_CLOUD_COVER 1.6533
## 8 MEAN_ANNUAL_SUNSHINE -0.0031
## 9 MAX_MONTHLY_WIND_SPEED 0.0000
## 10 MAX_AIR_TEMP -16.0839
## 11 MAX_WIND_SPEED 0.0000
## 12 MAX_RAINFALL 19.1592
## 13 MIN_AIR_TEMP 9.3296
## 14 MEAN_RANGE_AIR_TEMP 0.0000
Based on the comparison of the two models using lambda.min and lambda.1se,We evaluated two elastic net regression models with alpha = 0.3, using lambda.min and lambda.1se. The model with lambda.min had a lower MSE of 11,412.89 and RMSE of 106.83, while the lambda.1se model had a slightly higher MSE of 13,078.75 and RMSE of 114.36.
In terms of complexity, the lambda.min model used 13 predictors, whereas the lambda.1se model used only 9 predictors, indicating stronger regularization.
This shows a typical trade-off: lambda.min provides better predictive accuracy, but with more variables in the model. lambda.1se gives a more parsimonious model (fewer predictors), which may be preferred for simplicity or interpretability, even at the cost of slightly higher error.
In real-world applications, especially when working with environmental or climate-related data, interpretability and generalizability are often more important than marginal gains in accuracy. A simpler model is also easier to communicate and deploy.
Therefore, the lambda.1se model strikes a better balance between performance and model simplicity.
The formula for the final model is as follows: \[ \begin{aligned} \hat{Y} =\ & 612.3961 \\ &+ 0.1671 \cdot \text{ALTITUDE} \\ &- 4.9645 \cdot \text{MEAN\_ANNUAL\_AIR\_TEMP} \\ &- 1.0373 \cdot \text{MEAN\_MONTHLY\_MIN\_TEMP} \\ &- 8.0770 \cdot \text{MEAN\_ANNUAL\_WIND\_SPEED} \\ &+ 1.6533 \cdot \text{MEAN\_CLOUD\_COVER} \\ &- 0.0031 \cdot \text{MEAN\_ANNUAL\_SUNSHINE} \\ &- 16.0839 \cdot \text{MAX\_AIR\_TEMP} \\ &+ 19.1592 \cdot \text{MAX\_RAINFALL} \\ &+ 9.3296 \cdot \text{MIN\_AIR\_TEMP} \end{aligned} \]